1. Data preparation¶

In [1]:
# sentiment
categories = ['nostalgia', 'not nostalgia']
In [2]:
# download data from web

import pandas as pd
df = pd.read_csv("hf://datasets/Senem/Nostalgic_Sentiment_Analysis_of_YouTube_Comments_Data/Nostalgic_Sentiment_Analysis_of_YouTube_Comments_Data.csv")
In [3]:
# observe data
print(df)
X = df.rename(columns={'sentiment': 'sentiment_name'})
          sentiment                                            comment
0     not nostalgia  He was a singer with a golden voice that I lov...
1         nostalgia  The mist beautiful voice ever I listened to hi...
2         nostalgia  I have most of Mr. Reeves songs.  Always love ...
3     not nostalgia  30 day leave from 1st tour in Viet Nam to conv...
4         nostalgia  listening to his songs reminds me of my mum wh...
...             ...                                                ...
1495  not nostalgia  i don't know!..but the opening of the video,.....
1496  not nostalgia  it's sad this is such a beautiful song when yo...
1497  not nostalgia  Dear Friend, I think age and time is not that ...
1498      nostalgia  I was born in 1954 and started to be aware of ...
1499      nostalgia  This is the first CD I bought after my marriag...

[1500 rows x 2 columns]
In [4]:
# my functions
import helpers_homework.data_mining_helpers as dmh

# 將類別文字轉換成數字
X['sentiment'] = X['sentiment_name'].apply(lambda t: dmh.format_labels_number(t, X))
# 更換排版順序
X = X[['sentiment','comment', 'sentiment_name']]
In [5]:
X[0:10]
Out[5]:
sentiment comment sentiment_name
0 1 He was a singer with a golden voice that I lov... not nostalgia
1 0 The mist beautiful voice ever I listened to hi... nostalgia
2 0 I have most of Mr. Reeves songs. Always love ... nostalgia
3 1 30 day leave from 1st tour in Viet Nam to conv... not nostalgia
4 0 listening to his songs reminds me of my mum wh... nostalgia
5 0 Every time I heard this song as a child, I use... nostalgia
6 0 My dad loved listening to Jim Reeves, when I w... nostalgia
7 0 i HAVE ALSO LISTENED TO Jim Reeves since child... nostalgia
8 1 Wherever you are you always in my heart not nostalgia
9 1 Elvis will always be number one no one can com... not nostalgia

2. Data Mining¶

2.1 Missing Data processing¶

In [6]:
# 確認資料中有沒有遺失數值
X.isnull().apply(lambda x: dmh.check_missing_values(x))
Out[6]:
sentiment comment sentiment_name
0 The amoung of missing records is: The amoung of missing records is: The amoung of missing records is:
1 0 0 0

資料中沒有遺失數值,不用進行處理

2.2 Dealing with Duplicate Data¶

In [7]:
# 確認有無重複資料
sum(X.duplicated())
Out[7]:
1
In [8]:
# 找尋重複資料
X[X.duplicated(keep=False)]
Out[8]:
sentiment comment sentiment_name
62 1 never heard this song before... WOW What an am... not nostalgia
78 1 never heard this song before... WOW What an am... not nostalgia
In [9]:
X.drop_duplicates(keep='first', inplace=True) #刪除重複的行,保留第一次出現的
X.reset_index(drop=True, inplace=True) # 從新排列索引
In [10]:
# 重新打印確認
X
Out[10]:
sentiment comment sentiment_name
0 1 He was a singer with a golden voice that I lov... not nostalgia
1 0 The mist beautiful voice ever I listened to hi... nostalgia
2 0 I have most of Mr. Reeves songs. Always love ... nostalgia
3 1 30 day leave from 1st tour in Viet Nam to conv... not nostalgia
4 0 listening to his songs reminds me of my mum wh... nostalgia
... ... ... ...
1494 1 i don't know!..but the opening of the video,..... not nostalgia
1495 1 it's sad this is such a beautiful song when yo... not nostalgia
1496 1 Dear Friend, I think age and time is not that ... not nostalgia
1497 0 I was born in 1954 and started to be aware of ... nostalgia
1498 0 This is the first CD I bought after my marriag... nostalgia

1499 rows × 3 columns

3. Data processing¶

3.1 Sampling¶

In [11]:
X_sample = X.sample(n = 750) # random state
In [12]:
import matplotlib.pyplot as plt
%matplotlib inline

# 計算兩個 dataset 的類別統計
X_counts = X.sentiment_name.value_counts()
X_sample_counts = X_sample.sentiment_name.value_counts()

# 找到所有的類別,並確保兩組資料對齊
all_categories = X_counts.index

# 設定 bar 寬度和位置
bar_width = 0.2
index = range(len(all_categories))

# 繪製第一組資料的柱狀圖
plt.bar(index, X_counts, bar_width, label='Dataset X')

# 繪製第二組資料的柱狀圖,並將其向右偏移
plt.bar([i + bar_width for i in index], X_sample_counts, bar_width, label='Dataset X_sample')

# 設定標題和標籤
plt.title('Category distribution ')
plt.xticks([i + bar_width / 2 for i in index],all_categories, rotation=0)

# 添加圖例
plt.legend()

# 顯示圖表
plt.show()
No description has been provided for this image

可以看見兩個類別的分布是 1:1 ,在整體的會分懷舊少一筆主要原因是因為有重複的資料進行了移除

3.2 Feature Creation¶

In [13]:
import nltk
In [14]:
# takes a like a minute or two to process
X['unigrams'] = X['comment'].apply(lambda x: dmh.tokenize_text(x))

3.3 CountVectorizer¶

3.3.1 Feature subset selection¶

In [15]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()
X_counts = count_vect.fit_transform(X.comment)
feature_terms = count_vect.get_feature_names_out()
tdm_df = pd.DataFrame(X_counts.toarray(), columns=feature_terms, index=X.index) 
In [16]:
X_counts.shape # document=1499, feature=3730
Out[16]:
(1499, 3730)
In [17]:
# 進行目前特徵的矩陣做一些觀察
plot_x = ["term_"+str(i) for i in feature_terms[0:20]]
plot_y = ["doc_"+ str(i) for i in list(X.index)[0:20]]
plot_z = X_counts[0:20, 0:20].toarray() # X_counts[how many documents, how many terms]
In [18]:
# 使用熱圖觀察
import seaborn as sns

df_todraw = pd.DataFrame(plot_z, columns = plot_x, index = plot_y)
plt.subplots(figsize=(5, 5))
ax = sns.heatmap(df_todraw,
                 cmap="PuRd", #熱圖的顏色映射為粉紅色調
                 vmin=0, vmax=1, annot=True) #annot 熱圖的每個格子中顯示數據值
plt.show()
No description has been provided for this image

可以看見前20個特徵在前20個文檔的出現度很低

3.3.2 Attribute Transformation¶

In [19]:
# 找出所有特徵出現的頻率
import numpy as np
term_frequencies = []
term_frequencies = np.asarray(X_counts.sum(axis=0))[0] # 列的方向進行計算
In [20]:
term_frequencies # 所有特徵出現的次數
feature_terms # 所有特徵名字
Out[20]:
array(['00', '000', '045', ..., 'yup', 'zealand', 'zulus'], dtype=object)

原始數據的觀察

In [21]:
# 觀察前300特徵分布(原始數據)
plt.close() # 關掉先前的圖表
plt.subplots(figsize=(100, 10)) 
g = sns.barplot(x=feature_terms[:300], y=term_frequencies[:300])
g.set_xticks(range(300))  # 設置 x 軸的刻度位置
g.set_xticklabels(feature_terms[:300], rotation = 90);
plt.show()
No description has been provided for this image
In [22]:
# 使用動態圖表觀察原始數據(由大到小前300大)
import plotly.express as px

plt.close()
data = pd.DataFrame({'Terms': feature_terms, 'Frequencies': term_frequencies})
top_data = data.nlargest(300, 'Frequencies').sort_values(by='Frequencies', ascending=False) # 選擇出現次數前50高的
fig = px.bar(top_data, x='Terms', y='Frequencies', title='Top 300 Most Frequent Terms', text='Frequencies')
fig.update_traces(texttemplate='%{text}', textposition='outside') 
fig.update_layout(xaxis_tickangle=-90)
fig.show()

原始數據取log的觀察

In [23]:
# calculate log data frequency
import math
term_frequencies_log = [math.log(i) for i in term_frequencies]
In [24]:
# 觀察前300分布(log數據)
plt.close() # 關掉先前的圖表
plt.subplots(figsize=(100, 10)) 
g = sns.barplot(x=feature_terms[:300], y=term_frequencies_log[:300])
g.set_xticks(range(300))  # 設置 x 軸的刻度位置
g.set_xticklabels(feature_terms[:300], rotation = 90);
plt.show()
No description has been provided for this image
In [25]:
# 使用動態圖表觀察log數據(由大到小前300大)
import plotly.express as px

plt.close()
data = pd.DataFrame({'Terms': feature_terms, 'Frequencies': term_frequencies_log})
top_data = data.nlargest(300, 'Frequencies').sort_values(by='Frequencies', ascending=False) # 選擇出現次數前50高的
fig = px.bar(top_data, x='Terms', y='Frequencies', title='Top 300 Most Frequent Terms', text='Frequencies')
fig.update_traces(texttemplate='%{text}', textposition='outside') 
fig.update_layout(xaxis_tickangle=-90)
fig.show()

3.3.3 Attribute Aggregation¶

找出各類別的原始特徵¶

In [26]:
category_numbers = [0,1]
category_dfs = {} 
for category in categories:
    category_dfs[category] = X[X['sentiment_name'] == category].copy()
In [27]:
# 定義生成稀疏陣列的方程式
def create_term_document_df_CountVector(df,min_df=0.0,max_df=1.0):
    count_vect_temp = CountVectorizer(min_df=min_df, max_df=max_df) # Initialize the CountVectorizer
    X_counts_temp = count_vect_temp.fit_transform(df['comment'])  # Transform the text data into word counts
    words_temp = count_vect_temp.get_feature_names_out()
    term_document_df_temp = pd.DataFrame(X_counts_temp.toarray(), columns=words_temp)
    return term_document_df_temp
In [28]:
# 分別對兩個類別找屬於他們的特徵
filt_term_document_dfs = {} 
for category in categories:
    filt_term_document_dfs[category] = create_term_document_df_CountVector(category_dfs[category])
In [29]:
# 顯示稀疏陣列
for category in categories:
    print(f"Filtered Term-Document Frequency DataFrame for Category {category}:")
    print(filt_term_document_dfs[category])
Filtered Term-Document Frequency DataFrame for Category nostalgia:
     07  10  11  11th  12  13  14  15  16  17  ...  young  younger  youngster  \
0     0   0   0     0   0   0   0   0   0   0  ...      0        0          0   
1     0   0   0     0   0   0   0   0   0   1  ...      0        0          0   
2     0   0   0     0   0   0   0   0   0   0  ...      0        0          0   
3     0   0   0     0   0   0   0   0   0   0  ...      0        0          0   
4     0   0   0     0   0   0   0   0   0   0  ...      0        0          0   
..   ..  ..  ..   ...  ..  ..  ..  ..  ..  ..  ...    ...      ...        ...   
745   0   0   0     0   0   0   0   0   0   0  ...      0        0          0   
746   0   0   0     0   0   0   0   0   0   0  ...      0        0          0   
747   0   0   0     0   0   1   0   0   0   0  ...      0        0          0   
748   0   0   0     0   0   0   0   0   0   0  ...      0        0          0   
749   0   0   0     0   0   0   0   0   0   0  ...      0        0          0   

     your  yours  youth  youthful  youtube  yrs  yup  
0       0      0      0         0        0    0    0  
1       0      0      0         0        0    0    0  
2       0      0      0         0        0    0    0  
3       1      0      0         0        0    0    0  
4       0      0      0         0        0    0    0  
..    ...    ...    ...       ...      ...  ...  ...  
745     0      0      0         0        0    0    1  
746     0      0      0         0        0    0    0  
747     0      0      0         0        0    0    0  
748     0      0      0         0        0    0    0  
749     0      0      0         0        0    0    0  

[750 rows x 2295 columns]
Filtered Term-Document Frequency DataFrame for Category not nostalgia:
     00  000  045  10  100  10m  11  12  14  15  ...  youngest  youngsters  \
0     0    0    0   0    0    0   0   0   0   0  ...         0           0   
1     0    0    0   0    0    0   0   0   0   0  ...         0           0   
2     0    0    0   0    0    0   0   0   0   0  ...         0           0   
3     0    0    0   0    0    0   0   0   0   0  ...         0           0   
4     0    0    0   0    0    0   0   0   0   0  ...         0           0   
..   ..  ...  ...  ..  ...  ...  ..  ..  ..  ..  ...       ...         ...   
744   0    0    0   0    0    0   0   0   0   0  ...         0           0   
745   0    0    0   0    0    0   0   0   0   0  ...         0           0   
746   0    0    0   0    0    0   0   0   0   0  ...         0           0   
747   0    0    0   0    0    0   0   0   0   0  ...         0           0   
748   0    0    0   0    0    0   0   0   0   0  ...         0           0   

     your  yourself  youth  youtube  yrs  yuo  zealand  zulus  
0       0         0      0        0    0    0        0      0  
1       0         0      0        0    0    0        0      0  
2       0         0      0        0    0    0        0      0  
3       0         0      0        0    0    0        0      0  
4       0         0      0        0    0    0        0      0  
..    ...       ...    ...      ...  ...  ...      ...    ...  
744     0         0      0        0    0    0        0      0  
745     0         0      0        0    0    0        0      0  
746     0         0      0        0    0    0        0      0  
747     1         0      0        0    0    0        0      0  
748     0         0      0        0    0    0        0      0  

[749 rows x 2602 columns]

可以知道 nostalgia 最初有 2295 特徵;not nostalgia 有 2602 特徵

In [30]:
for category in categories:
    word_counts = filt_term_document_dfs[category].sum(axis=0).to_numpy()
    plt.close()
    plt.figure(figsize=(10, 6))
    plt.hist(word_counts, bins=100,color='blue', edgecolor='black')
    plt.title(f'Term Frequency Distribution for Category {category}')
    plt.xlabel('Frequency')
    plt.ylabel('Number of Terms')
    plt.xlim(1, 200)
    plt.show()
No description has been provided for this image
No description has been provided for this image

在兩個類別的圖片中可以發現其實大多特徵都是僅出現比較少次,我們要排除超級少的特徵以及多到不行的

刪除原始數據中過低與過高的特徵¶

In [31]:
# 建立刪除過小出現與過大出現次數的

def filter_top_bottom_words_by_sum(term_document_df, top_percent=0.05, bottom_percent=0.01):

    word_sums = term_document_df.sum(axis=0)
    sorted_words = word_sums.sort_values()
    
    total_words = len(sorted_words)
    top_n = int(top_percent * total_words) #過濾掉頻率高的數值
    bottom_n = int(bottom_percent * total_words) #過濾掉頻率低的數值
    
    words_to_remove = pd.concat([sorted_words.head(bottom_n), sorted_words.tail(top_n)]).index
    # print(f'Bottom {bottom_percent*100}% words: \n{sorted_words.head(bottom_n)}') #Here we print which words correspond to the bottom percentage we filter
    # print(f'Top {top_percent*100}% words: \n{sorted_words.tail(top_n)}') #Here we print which words correspond to the top percentage we filter
    
    return term_document_df.drop(columns=words_to_remove)
In [32]:
# 進行刪除最前面與最後面
term_document_dfs = {} 

for category in categories:
    print(f'\nFor category {category} we filter the following words:')
    term_document_dfs[category] = filter_top_bottom_words_by_sum(filt_term_document_dfs[category])
    print(f"Filtered Term-Document Frequency DataFrame for Category {category}:")
    print(term_document_dfs[category])
For category nostalgia we filter the following words:
Filtered Term-Document Frequency DataFrame for Category nostalgia:
     07  10  11  11th  12  13  14  15  16  17  ...  yo  yokel  younger  \
0     0   0   0     0   0   0   0   0   0   0  ...   0      0        0   
1     0   0   0     0   0   0   0   0   0   1  ...   0      0        0   
2     0   0   0     0   0   0   0   0   0   0  ...   0      0        0   
3     0   0   0     0   0   0   0   0   0   0  ...   0      0        0   
4     0   0   0     0   0   0   0   0   0   0  ...   0      0        0   
..   ..  ..  ..   ...  ..  ..  ..  ..  ..  ..  ...  ..    ...      ...   
745   0   0   0     0   0   0   0   0   0   0  ...   0      0        0   
746   0   0   0     0   0   0   0   0   0   0  ...   0      0        0   
747   0   0   0     0   0   1   0   0   0   0  ...   0      0        0   
748   0   0   0     0   0   0   0   0   0   0  ...   0      0        0   
749   0   0   0     0   0   0   0   0   0   0  ...   0      0        0   

     youngster  your  yours  youth  youthful  youtube  yrs  
0            0     0      0      0         0        0    0  
1            0     0      0      0         0        0    0  
2            0     0      0      0         0        0    0  
3            0     1      0      0         0        0    0  
4            0     0      0      0         0        0    0  
..         ...   ...    ...    ...       ...      ...  ...  
745          0     0      0      0         0        0    0  
746          0     0      0      0         0        0    0  
747          0     0      0      0         0        0    0  
748          0     0      0      0         0        0    0  
749          0     0      0      0         0        0    0  

[750 rows x 2159 columns]

For category not nostalgia we filter the following words:
Filtered Term-Document Frequency DataFrame for Category not nostalgia:
     000  045  10  100  10m  11  12  14  15  150  ...  younger  youngest  \
0      0    0   0    0    0   0   0   0   0    0  ...        0         0   
1      0    0   0    0    0   0   0   0   0    0  ...        0         0   
2      0    0   0    0    0   0   0   0   0    0  ...        0         0   
3      0    0   0    0    0   0   0   0   0    0  ...        0         0   
4      0    0   0    0    0   0   0   0   0    0  ...        0         0   
..   ...  ...  ..  ...  ...  ..  ..  ..  ..  ...  ...      ...       ...   
744    0    0   0    0    0   0   0   0   0    0  ...        0         0   
745    0    0   0    0    0   0   0   0   0    0  ...        0         0   
746    0    0   0    0    0   0   0   0   0    0  ...        0         0   
747    0    0   0    0    0   0   0   0   0    0  ...        0         0   
748    0    0   0    0    0   0   0   0   0    0  ...        0         0   

     youngsters  yourself  youth  youtube  yrs  yuo  zealand  zulus  
0             0         0      0        0    0    0        0      0  
1             0         0      0        0    0    0        0      0  
2             0         0      0        0    0    0        0      0  
3             0         0      0        0    0    0        0      0  
4             0         0      0        0    0    0        0      0  
..          ...       ...    ...      ...  ...  ...      ...    ...  
744           0         0      0        0    0    0        0      0  
745           0         0      0        0    0    0        0      0  
746           0         0      0        0    0    0        0      0  
747           0         0      0        0    0    0        0      0  
748           0         0      0        0    0    0        0      0  

[749 rows x 2446 columns]

經過刪除之後我們要的檔案特徵變成:2602 -> 2446 ; 2295 ->2159

In [33]:
# 將現在我要的特徵儲存進csv方便取用
from PAMI.extras.convert.DF2DB import DF2DB

for category in term_document_dfs:
    category_safe = category.replace(' ', '_')
    obj = DF2DB(term_document_dfs[category])
    obj.convert2TransactionalDatabase(f'./td_freq_db/td_freq_db_{category_safe}.csv', '>=', 1)
In [34]:
# transactional Dataset Observe
from PAMI.extras.dbStats import TransactionalDatabase as tds

def observaion_transactional_Database(name):
    plt.close()
    name = name.replace(' ', '_')
    obj = tds.TransactionalDatabase(f'./td_freq_db/td_freq_db_{name}.csv')
    print(f'Transational Dataset {name}:')
    obj.run()
    obj.printStats()
    obj.plotGraphs()
    plt.show()
In [35]:
for category in categories:
    observaion_transactional_Database(category)
Transational Dataset nostalgia:
Database size (total no of transactions) : 734
Number of items : 2159
Minimum Transaction Size : 1
Average Transaction Size : 8.693460490463215
Maximum Transaction Size : 39
Standard Deviation Transaction Size : 7.213372063492091
Variance in Transaction Sizes : 52.10372252435774
Sparsity : 0.9959733855996001
No description has been provided for this image
No description has been provided for this image
Transational Dataset not_nostalgia:
Database size (total no of transactions) : 745
Number of items : 2446
Minimum Transaction Size : 1
Average Transaction Size : 8.410738255033557
Maximum Transaction Size : 46
Standard Deviation Transaction Size : 5.926429722323316
Variance in Transaction Sizes : 35.16977700801039
Sparsity : 0.9965614316210002
No description has been provided for this image
No description has been provided for this image

augmented_df 的製作 by FPGrowth, FAE topK, MaxFPGrowth¶

In [36]:
# use FPGrowth with minsup
from PAMI.frequentPattern.basic import FPGrowth as alg

def FPGrowth_minsup(minSup,name):
    obj = alg.FPGrowth(iFile=f'./td_freq_db/td_freq_db_{name}.csv', minSup=minSup)
    obj.mine()
    frequentPatternsDF_temp = obj.getPatternsAsDataFrame()
    print(name)
    print('Total No of patterns: ' + str(len(frequentPatternsDF_temp)))
    print('Runtime: ' + str(obj.getRuntime()))
    obj.save(f'./freq_patterns_minsup/freq_patterns_{name}_minSup{minSup}.txt') #save the patterns
    return frequentPatternsDF_temp
In [37]:
# use FAE topK
from PAMI.frequentPattern.topk import FAE
def FAE_topK(k,name):
    obj = FAE.FAE(iFile=f'./td_freq_db/td_freq_db_{name}.csv', k=k)
    obj.mine()
    frequentPatternsDF_temp = obj.getPatternsAsDataFrame()
    print(name)
    print('Total No of patterns: ' + str(len(frequentPatternsDF_temp)))
    print('Runtime: ' + str(obj.getRuntime()))
    obj.save(f'./freq_patterns_topK/freq_patterns_{name}_topK{k}.txt') #save the patterns
    return frequentPatternsDF_temp
In [38]:
# use MaxFPGrowth with minsup
from PAMI.frequentPattern.maximal import MaxFPGrowth as algm

def FPGrowth_max(minSup,name):
    obj = algm.MaxFPGrowth(iFile=f'./td_freq_db/td_freq_db_{name}.csv', minSup=minSup)
    obj.mine()
    frequentPatternsDF_temp = obj.getPatternsAsDataFrame()
    print(name)
    print('Total No of patterns: ' + str(len(frequentPatternsDF_temp)))
    print('Runtime: ' + str(obj.getRuntime()))
    obj.save(f'./freq_patterns_max/freq_patterns_{name}_max_minSup{minSup}.txt') #save the patterns
    return frequentPatternsDF_temp
In [39]:
def pattern_integrate(frequentPatternsDF):

    dfs = []
    for category in categories:
        dfs.append(frequentPatternsDF[category])
        
    combined_df = pd.concat(dfs, ignore_index=True)
    pattern_counts = combined_df['Patterns'].value_counts()
    unique_patterns = pattern_counts[pattern_counts == 1].index
    final_pattern_df = combined_df[combined_df['Patterns'].isin(unique_patterns)].sort_values(by='Support', ascending=False)
    # print(final_pattern_df) 
    # print(f"Number of patterns discarded: {(len(pattern_counts) - len(unique_patterns))*2}")  # Count of discarded patterns

    return final_pattern_df
In [40]:
def augmented_df_generation(final_pattern_df):
    X['tokenized_comment'] = X['comment'].str.split().apply(set)
    pattern_matrix = pd.DataFrame(0, index=X.index, columns=final_pattern_df['Patterns'])
    
    for pattern in final_pattern_df['Patterns']:
        pattern_words = set(pattern.split())  # Tokenize pattern into words
        pattern_matrix[pattern] = X['tokenized_comment'].apply(lambda x: 1 if pattern_words.issubset(x) else 0)
        
    augmented_df = pd.concat([tdm_df, pattern_matrix], axis=1) # 結合上方找出的特徵

    return augmented_df 
In [41]:
# 使用 minsup = 3去找出模型訓練的特徵

frequentPatternsDF_minsup = {}

for category in categories:
    category_save = category.replace(' ', '_')
    frequentPatternsDF_minsup[category] = FPGrowth_minsup(3,category_save)
    print (frequentPatternsDF_minsup[category])
    
final_pattern_df_minsup = pattern_integrate(frequentPatternsDF_minsup)
augmented_df_minsup = augmented_df_generation(final_pattern_df_minsup)
augmented_df_minsup
Frequent patterns were generated successfully using frequentPatternGrowth algorithm
nostalgia
Total No of patterns: 948
Runtime: 0.030440568923950195
          Patterns  Support
0           forgot        3
1               mr        3
2       appreciate        3
3            death        3
4        death jim        3
..             ...      ...
943          would       28
944           will       28
945  will favorite        3
946             go       28
947       favorite       30

[948 rows x 2 columns]
Frequent patterns were generated successfully using frequentPatternGrowth algorithm
not_nostalgia
Total No of patterns: 730
Runtime: 0.014939546585083008
         Patterns  Support
0       emotional        3
1             fan        3
2              30        3
3        blessing        3
4        december        3
..            ...      ...
725       classic       21
726          them       21
727    them every        4
728        lyrics       21
729  lyrics every        3

[730 rows x 2 columns]
Out[41]:
00 000 045 07 10 100 10m 11 11th 12 ... later ever later year later been make cry make where make them hearing away missed today country favorite lyrics every
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1494 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1495 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1496 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1497 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1498 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

1499 rows × 4784 columns

In [42]:
# 使用 FAE_topK k=800 去找出模型訓練的特徵

frequentPatternsDF_topK = {}

for category in categories:
    category_save = category.replace(' ', '_')
    frequentPatternsDF_topK[category] = FAE_topK(800,category_save)
    print (frequentPatternsDF_topK[category])
    
final_pattern_df_topK = pattern_integrate(frequentPatternsDF_topK)
augmented_df_topK = augmented_df_generation(final_pattern_df_topK)
augmented_df_topK
 TopK frequent patterns were successfully generated using FAE algorithm.
nostalgia
Total No of patterns: 800
Runtime: 0.39636778831481934
         Patterns  Support
0        favorite       30
1            ever       28
2           would       28
3            will       28
4              go       28
..            ...      ...
795      over get        3
796  over country        3
797    over which        3
798      over pop        3
799  over perfect        3

[800 rows x 2 columns]
 TopK frequent patterns were successfully generated using FAE algorithm.
not_nostalgia
Total No of patterns: 800
Runtime: 0.282914400100708
       Patterns  Support
0         elvis       21
1         every       21
2         loved       21
3       classic       21
4          them       21
..          ...      ...
795  difference        2
796        nine        2
797        slap        2
798     naughty        2
799       needs        2

[800 rows x 2 columns]
Out[42]:
00 000 045 07 10 100 10m 11 11th 12 ... fall describes compose memorable genre amazingly sweetest arms cruel needs
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1494 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1495 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1496 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1497 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1498 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

1499 rows × 4700 columns

In [43]:
# 使用 MaxFPGrowth minsup = 3 去找出模型訓練的特徵

frequentPatternsDF_max = {}

for category in categories:
    category_save = category.replace(' ', '_')
    frequentPatternsDF_max[category] = FPGrowth_max(3,category_save)
    print (frequentPatternsDF_max[category])
    
final_pattern_df_max = pattern_integrate(frequentPatternsDF_max)
augmented_df_max = augmented_df_generation(final_pattern_df_max)
augmented_df_max
Maximal Frequent patterns were generated successfully using MaxFp-Growth algorithm 
nostalgia
Total No of patterns: 682
Runtime: 0.03948616981506348
           Patterns  Support
0          skating         3
1           walker         3
2            scott         3
3          17 1987         3
4             stop         3
..              ...      ...
677      will such         4
678      ever only         3
679     would only         4
680       ever kid         3
681  favorite will         3

[682 rows x 2 columns]
Maximal Frequent patterns were generated successfully using MaxFp-Growth algorithm 
not_nostalgia
Total No of patterns: 592
Runtime: 0.0324559211730957
          Patterns  Support
0          thinks         3
1          months         3
2       currently         3
3            kids         3
4            wait         3
..             ...      ...
587          days        20
588  every lyrics         3
589    every them         4
590       classic        21
591         loved        21

[592 rows x 2 columns]
Out[43]:
00 000 045 07 10 100 10m 11 11th 12 ... wish could see ever boy us too been too about too listened singer no singer well singer since singer since got
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1494 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1495 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1496 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1497 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1498 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

1499 rows × 4710 columns

3.3.4 Dimensionality Reduction¶

2D by PCA, t-SNE, UMAP¶

In [44]:
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import umap
In [45]:
col = ['coral', 'blue']
In [46]:
def Dimensionality_2D(now_df):
    X_pca = PCA(n_components=2).fit_transform(now_df.values)
    X_tsne = TSNE(n_components=2).fit_transform(now_df.values)
    X_umap = umap.UMAP(n_components=2).fit_transform(now_df.values)
    return X_pca, X_tsne, X_umap
In [47]:
def plot_scatter(ax, X_reduced, title):
    for c, category in zip(col, categories):
        xs = X_reduced[X['sentiment_name'] == category].T[0]
        ys = X_reduced[X['sentiment_name'] == category].T[1]
        ax.scatter(xs, ys, c=c, marker='o', label=category)
    
    ax.grid(color='gray', linestyle=':', linewidth=2, alpha=0.2)
    ax.set_title(title)
    ax.set_xlabel('X')
    ax.set_ylabel('Y')
    ax.legend(loc='upper right')
In [48]:
def draw_2D_plt(X_pca, X_tsne, X_umap):
    plt.close()
    fig, axes = plt.subplots(1, 3, figsize=(30, 10))  
    fig.suptitle('PCA, t-SNE, and UMAP Comparison')
    plot_scatter(axes[0], X_pca, 'PCA')
    plot_scatter(axes[1], X_tsne, 't-SNE')
    plot_scatter(axes[2], X_umap, 'UMAP')
    plt.show()
In [49]:
X_pca, X_tsne, X_umap = Dimensionality_2D(tdm_df)
In [50]:
draw_2D_plt(X_pca, X_tsne, X_umap)
No description has been provided for this image
In [51]:
X_pca, X_tsne, X_umap = Dimensionality_2D(augmented_df_minsup) # minsup FPGrowth
In [52]:
X_pca.shape
Out[52]:
(1499, 2)
In [53]:
draw_2D_plt(X_pca, X_tsne, X_umap)
No description has been provided for this image

3D by PCA, t-SNE, UMAP¶

In [54]:
from mpl_toolkits.mplot3d import Axes3D
In [55]:
def Dimensionality_3D(now_df):
    X_pca = PCA(n_components=3).fit_transform(now_df.values)
    X_tsne = TSNE(n_components=3).fit_transform(now_df.values)
    X_umap = umap.UMAP(n_components=3).fit_transform(now_df.values)
    return X_pca, X_tsne, X_umap
In [56]:
X_pca_minsup, X_tsne_minsup, X_umap_minsup = Dimensionality_3D(augmented_df_minsup)
In [57]:
X_pca_topK, X_tsne_topK, X_umap_topK = Dimensionality_3D(augmented_df_topK)
In [58]:
X_pca_max, X_tsne_max, X_umap_max = Dimensionality_3D(augmented_df_max)
In [59]:
angle_3D = [[0, 15, 90], [0, 60, 120]]

# Define a function to create 3D scatter plot
def plot_scatter_3d(ax, X_reduced, title):
    for c, category in zip(col, categories):
        xs = X_reduced[X['sentiment_name'] == category][:, 0]
        ys = X_reduced[X['sentiment_name'] == category][:, 1]
        zs = X_reduced[X['sentiment_name'] == category][:, 2]
        ax.scatter(xs, ys, zs, c=c, marker='o', label=category)

    ax.set_xlabel('X')
    ax.set_ylabel('Y')
    ax.set_zlabel('Z')
    ax.legend(loc='upper right')
    ax.set_title(title)
In [60]:
# 建立模型三維圖片(PCA)
plt.close()
fig = plt.figure(figsize=(30, 30))  # 創建一個大圖形以容納九個子圖
fig.suptitle('PCA 3D from Three Angles and Three Augmented DataFrames')

# 第一組子圖
for i in range(3):
    ax = fig.add_subplot(3, 3, 1 + i, projection='3d')  # 3行3列
    plot_scatter_3d(ax, X_pca_minsup, f'PCA_minsup ({angle_3D[0][i]}, {angle_3D[1][i]})')
    ax.view_init(angle_3D[0][i], angle_3D[1][i])  # 設定視角

# 第二組子圖
for i in range(3):
    ax = fig.add_subplot(3, 3, 4 + i, projection='3d')  # 3行3列,接下來的3個子圖
    plot_scatter_3d(ax, X_pca_topK, f'PCA_topK ({angle_3D[0][i]}, {angle_3D[1][i]})')
    ax.view_init(angle_3D[0][i], angle_3D[1][i])  # 設定視角

# 第三組子圖
for i in range(3):
    ax = fig.add_subplot(3, 3, 7 + i, projection='3d')  # 3行3列,最後的3個子圖
    plot_scatter_3d(ax, X_pca_max, f'PCA_max ({angle_3D[0][i]}, {angle_3D[1][i]})')
    ax.view_init(angle_3D[0][i], angle_3D[1][i])  # 設定視角
plt.show()
No description has been provided for this image
In [61]:
# 建立模型三維圖片(PCA)
plt.close()
fig = plt.figure(figsize=(30, 30))  # 創建一個大圖形以容納九個子圖
fig.suptitle('t-SNE 3D from Three Angles and Three Augmented DataFrames')

# 第一組子圖
for i in range(3):
    ax = fig.add_subplot(3, 3, 1 + i, projection='3d')  # 3行3列
    plot_scatter_3d(ax, X_tsne_minsup, f't-SNE_minsup ({angle_3D[0][i]}, {angle_3D[1][i]})')
    ax.view_init(angle_3D[0][i], angle_3D[1][i])  # 設定視角

# 第二組子圖
for i in range(3):
    ax = fig.add_subplot(3, 3, 4 + i, projection='3d')  # 3行3列,接下來的3個子圖
    plot_scatter_3d(ax, X_tsne_topK, f't-SNE_topK ({angle_3D[0][i]}, {angle_3D[1][i]})')
    ax.view_init(angle_3D[0][i], angle_3D[1][i])  # 設定視角

# 第三組子圖
for i in range(3):
    ax = fig.add_subplot(3, 3, 7 + i, projection='3d')  # 3行3列,最後的3個子圖
    plot_scatter_3d(ax, X_tsne_max, f't-SNE_max ({angle_3D[0][i]}, {angle_3D[1][i]})')
    ax.view_init(angle_3D[0][i], angle_3D[1][i])  # 設定視角
plt.show()
No description has been provided for this image
In [62]:
# 建立模型三維圖片(UMAP)
plt.close()
fig = plt.figure(figsize=(30, 30))  # 創建一個大圖形以容納九個子圖
fig.suptitle('UMAP 3D from Three Angles and Three Augmented DataFrames')

# 第一組子圖
for i in range(3):
    ax = fig.add_subplot(3, 3, 1 + i, projection='3d')  # 3行3列
    plot_scatter_3d(ax, X_umap_minsup, f'UMAP_minsup ({angle_3D[0][i]}, {angle_3D[1][i]})')
    ax.view_init(angle_3D[0][i], angle_3D[1][i])  # 設定視角

# 第二組子圖
for i in range(3):
    ax = fig.add_subplot(3, 3, 4 + i, projection='3d')  # 3行3列,接下來的3個子圖
    plot_scatter_3d(ax, X_umap_topK, f'UMAP_topK ({angle_3D[0][i]}, {angle_3D[1][i]})')
    ax.view_init(angle_3D[0][i], angle_3D[1][i])  # 設定視角

# 第三組子圖
for i in range(3):
    ax = fig.add_subplot(3, 3, 7 + i, projection='3d')  # 3行3列,最後的3個子圖
    plot_scatter_3d(ax, X_umap_max, f'UMAP_max ({angle_3D[0][i]}, {angle_3D[1][i]})')
    ax.view_init(angle_3D[0][i], angle_3D[1][i])  # 設定視角
plt.show()
No description has been provided for this image

在第一行的圖表表明的是FPGrowth與Supmin=3的情況,在PCA與t-SNE看不太出來他們的分類,UMAP的圖形可以比較看的出來,藍色分布一邊,橘色分布一邊

在第二行的圖表表明的是FEA top 800的情況,在t-SNE是所有裡面分布最有差異的,其他都與第一行的情形差不多

在第三行的圖表表明的是MaxFPGrowth與Supmin=3的情況,在PCA依舊看不太出來他們的分類與前兩種方法差不多,而在t-SNE則是更加密集完全看不出,而在UMAP的圖形則是在第三個角度可以明顯發現兩個顏色各靠一邊

3.3.5 Discretization and Binarization¶

In [63]:
from sklearn import preprocessing, metrics, decomposition, pipeline, dummy
In [64]:
mlb = preprocessing.LabelBinarizer()
In [65]:
mlb.fit(X.sentiment)
Out[65]:
LabelBinarizer()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LabelBinarizer()
In [66]:
X['bin_sentiment'] = mlb.transform(X['sentiment']).tolist()
In [67]:
X[0:9]
Out[67]:
sentiment comment sentiment_name unigrams tokenized_comment bin_sentiment
0 1 He was a singer with a golden voice that I lov... not nostalgia [He, was, a, singer, with, a, golden, voice, t... {love, emotional, at, great, vouch, You, age, ... [1]
1 0 The mist beautiful voice ever I listened to hi... nostalgia [The, mist, beautiful, voice, ever, I, listene... {love, Never, I, when, voice, The, an, and, hi... [0]
2 0 I have most of Mr. Reeves songs. Always love ... nostalgia [I, have, most, of, Mr., Reeves, songs, ., Alw... {so, love, comforting, people, sounds, up, wer... [0]
3 1 30 day leave from 1st tour in Viet Nam to conv... not nostalgia [30, day, leave, from, 1st, tour, in, Viet, Na... {back, be, me", God, 30, receive., "marry, 1st... [1]
4 0 listening to his songs reminds me of my mum wh... nostalgia [listening, to, his, songs, reminds, me, of, m... {reminds, of, songs, played, listening, mum, m... [0]
5 0 Every time I heard this song as a child, I use... nostalgia [Every, time, I, heard, this, song, as, a, chi... {death,, reminded, got, child,, time, song., E... [0]
6 0 My dad loved listening to Jim Reeves, when I w... nostalgia [My, dad, loved, listening, to, Jim, Reeves, ,... {back, changes, listening, Time, loved, do, I,... [0]
7 0 i HAVE ALSO LISTENED TO Jim Reeves since child... nostalgia [i, HAVE, ALSO, LISTENED, TO, Jim, Reeves, sin... {love, 71, he, nostalgic, LISTENED, I, ALSO, J... [0]
8 1 Wherever you are you always in my heart not nostalgia [Wherever, you, are, you, always, in, my, heart] {Wherever, in, you, my, are, always, heart} [1]

3.4 TF-IDF¶

3.4.1 Feature subset selection¶

In [68]:
from sklearn.feature_extraction.text import TfidfVectorizer
In [69]:
TFIDF_vect = TfidfVectorizer()
X_TFIDF = TFIDF_vect.fit_transform(X.comment)
TFIDF_terms = TFIDF_vect.get_feature_names_out()
TFIDF_df = pd.DataFrame(X_TFIDF.toarray(), columns=TFIDF_terms, index=X.index) 
In [70]:
TFIDF_df.shape
Out[70]:
(1499, 3730)
In [71]:
# 進行目前特徵的矩陣做一些觀察
plot_x = ["term_"+str(i) for i in TFIDF_terms[0:20]]
plot_y = ["doc_"+ str(i) for i in list(X.index)[0:20]]
plot_z = X_TFIDF[0:20, 0:20].toarray() # X_counts[how many documents, how many terms]
In [72]:
# 使用熱圖觀察
import seaborn as sns

df_todraw = pd.DataFrame(plot_z, columns = plot_x, index = plot_y)
plt.subplots(figsize=(10, 5))
ax = sns.heatmap(df_todraw,
                 cmap="PuRd", #熱圖的顏色映射為粉紅色調
                 vmin=0, annot=True) #annot 熱圖的每個格子中顯示數據值
plt.show()
No description has been provided for this image

可以看見前20個特徵在前20個文檔的可用性很低

3.4.2 Attribute Aggregation¶

找出各類別的原始特徵¶

In [75]:
from sklearn.feature_selection import VarianceThreshold
# 定義生成並篩選稀疏陣列的函數
def create_term_document_df_TfidfVector(df, threshold=0.0, min_df=0.0, max_df=1.0):
    TFIDF_vect_temp = TfidfVectorizer(min_df=min_df, max_df=max_df)
    X_TFIDF_temp = TFIDF_vect_temp.fit_transform(df['comment'])
    selector = VarianceThreshold(threshold=threshold)
    X_selected = selector.fit_transform(X_TFIDF_temp.toarray())
    selected_features = TFIDF_vect_temp.get_feature_names_out()[selector.get_support()]
    term_document_df_temp = pd.DataFrame(X_selected, columns=selected_features)
    return term_document_df_temp
In [76]:
# 分別對兩個類別找屬於他們的特徵
filt_term_document_dfs_TFIDF = {} 
for category in categories:
    filt_term_document_dfs_TFIDF[category] = create_term_document_df_TfidfVector(category_dfs[category])
In [77]:
# 顯示稀疏陣列
for category in categories:
    print(f"Filtered Term-Document Frequency DataFrame for Category {category}:")
    print(filt_term_document_dfs_TFIDF[category])
Filtered Term-Document Frequency DataFrame for Category nostalgia:
      07   10   11  11th   12        13   14   15   16        17  ...  young  \
0    0.0  0.0  0.0   0.0  0.0  0.000000  0.0  0.0  0.0  0.000000  ...    0.0   
1    0.0  0.0  0.0   0.0  0.0  0.000000  0.0  0.0  0.0  0.135932  ...    0.0   
2    0.0  0.0  0.0   0.0  0.0  0.000000  0.0  0.0  0.0  0.000000  ...    0.0   
3    0.0  0.0  0.0   0.0  0.0  0.000000  0.0  0.0  0.0  0.000000  ...    0.0   
4    0.0  0.0  0.0   0.0  0.0  0.000000  0.0  0.0  0.0  0.000000  ...    0.0   
..   ...  ...  ...   ...  ...       ...  ...  ...  ...       ...  ...    ...   
745  0.0  0.0  0.0   0.0  0.0  0.000000  0.0  0.0  0.0  0.000000  ...    0.0   
746  0.0  0.0  0.0   0.0  0.0  0.000000  0.0  0.0  0.0  0.000000  ...    0.0   
747  0.0  0.0  0.0   0.0  0.0  0.225266  0.0  0.0  0.0  0.000000  ...    0.0   
748  0.0  0.0  0.0   0.0  0.0  0.000000  0.0  0.0  0.0  0.000000  ...    0.0   
749  0.0  0.0  0.0   0.0  0.0  0.000000  0.0  0.0  0.0  0.000000  ...    0.0   

     younger  youngster      your  yours  youth  youthful  youtube  yrs  \
0        0.0        0.0  0.000000    0.0    0.0       0.0      0.0  0.0   
1        0.0        0.0  0.000000    0.0    0.0       0.0      0.0  0.0   
2        0.0        0.0  0.000000    0.0    0.0       0.0      0.0  0.0   
3        0.0        0.0  0.196577    0.0    0.0       0.0      0.0  0.0   
4        0.0        0.0  0.000000    0.0    0.0       0.0      0.0  0.0   
..       ...        ...       ...    ...    ...       ...      ...  ...   
745      0.0        0.0  0.000000    0.0    0.0       0.0      0.0  0.0   
746      0.0        0.0  0.000000    0.0    0.0       0.0      0.0  0.0   
747      0.0        0.0  0.000000    0.0    0.0       0.0      0.0  0.0   
748      0.0        0.0  0.000000    0.0    0.0       0.0      0.0  0.0   
749      0.0        0.0  0.000000    0.0    0.0       0.0      0.0  0.0   

          yup  
0    0.000000  
1    0.000000  
2    0.000000  
3    0.000000  
4    0.000000  
..        ...  
745  0.355567  
746  0.000000  
747  0.000000  
748  0.000000  
749  0.000000  

[750 rows x 2295 columns]
Filtered Term-Document Frequency DataFrame for Category not nostalgia:
      00  000  045   10  100  10m   11   12   14   15  ...  youngest  \
0    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...       0.0   
1    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...       0.0   
2    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...       0.0   
3    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...       0.0   
4    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...       0.0   
..   ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...       ...   
744  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...       0.0   
745  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...       0.0   
746  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...       0.0   
747  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...       0.0   
748  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...       0.0   

     youngsters      your  yourself  youth  youtube  yrs  yuo  zealand  zulus  
0           0.0  0.000000       0.0    0.0      0.0  0.0  0.0      0.0    0.0  
1           0.0  0.000000       0.0    0.0      0.0  0.0  0.0      0.0    0.0  
2           0.0  0.000000       0.0    0.0      0.0  0.0  0.0      0.0    0.0  
3           0.0  0.000000       0.0    0.0      0.0  0.0  0.0      0.0    0.0  
4           0.0  0.000000       0.0    0.0      0.0  0.0  0.0      0.0    0.0  
..          ...       ...       ...    ...      ...  ...  ...      ...    ...  
744         0.0  0.000000       0.0    0.0      0.0  0.0  0.0      0.0    0.0  
745         0.0  0.000000       0.0    0.0      0.0  0.0  0.0      0.0    0.0  
746         0.0  0.000000       0.0    0.0      0.0  0.0  0.0      0.0    0.0  
747         0.0  0.244427       0.0    0.0      0.0  0.0  0.0      0.0    0.0  
748         0.0  0.000000       0.0    0.0      0.0  0.0  0.0      0.0    0.0  

[749 rows x 2602 columns]
In [78]:
for category in categories:
    word_counts = filt_term_document_dfs_TFIDF[category].sum(axis=0).to_numpy()
    plt.close()
    plt.figure(figsize=(10, 6))
    plt.hist(word_counts, bins=100,color='blue', edgecolor='black')
    plt.title(f'Term Frequency Distribution for Category {category}')
    plt.xlabel('Frequency')
    plt.ylabel('Number of Terms')
    plt.show()
No description has been provided for this image
No description has been provided for this image

選擇各類別中比較有用的特徵¶

In [79]:
select_dfs_TFIDF = {}
for category in categories:
    select_dfs_TFIDF[category] = create_term_document_df_TfidfVector(category_dfs[category],0.0005)
    print(f"Filtered Term-Document Frequency DataFrame for Category {category}:")
    print(select_dfs_TFIDF[category])
Filtered Term-Document Frequency DataFrame for Category nostalgia:
      10   12        13   14   16        17   18  1963  1966  1973  ...  year  \
0    0.0  0.0  0.000000  0.0  0.0  0.000000  0.0   0.0   0.0   0.0  ...   0.0   
1    0.0  0.0  0.000000  0.0  0.0  0.135932  0.0   0.0   0.0   0.0  ...   0.0   
2    0.0  0.0  0.000000  0.0  0.0  0.000000  0.0   0.0   0.0   0.0  ...   0.0   
3    0.0  0.0  0.000000  0.0  0.0  0.000000  0.0   0.0   0.0   0.0  ...   0.0   
4    0.0  0.0  0.000000  0.0  0.0  0.000000  0.0   0.0   0.0   0.0  ...   0.0   
..   ...  ...       ...  ...  ...       ...  ...   ...   ...   ...  ...   ...   
745  0.0  0.0  0.000000  0.0  0.0  0.000000  0.0   0.0   0.0   0.0  ...   0.0   
746  0.0  0.0  0.000000  0.0  0.0  0.000000  0.0   0.0   0.0   0.0  ...   0.0   
747  0.0  0.0  0.225266  0.0  0.0  0.000000  0.0   0.0   0.0   0.0  ...   0.0   
748  0.0  0.0  0.000000  0.0  0.0  0.000000  0.0   0.0   0.0   0.0  ...   0.0   
749  0.0  0.0  0.000000  0.0  0.0  0.000000  0.0   0.0   0.0   0.0  ...   0.0   

        years  yes  yesterday  you  young  younger      your  youth  yrs  
0    0.000000  0.0        0.0  0.0    0.0      0.0  0.000000    0.0  0.0  
1    0.000000  0.0        0.0  0.0    0.0      0.0  0.000000    0.0  0.0  
2    0.000000  0.0        0.0  0.0    0.0      0.0  0.000000    0.0  0.0  
3    0.000000  0.0        0.0  0.0    0.0      0.0  0.196577    0.0  0.0  
4    0.000000  0.0        0.0  0.0    0.0      0.0  0.000000    0.0  0.0  
..        ...  ...        ...  ...    ...      ...       ...    ...  ...  
745  0.000000  0.0        0.0  0.0    0.0      0.0  0.000000    0.0  0.0  
746  0.000000  0.0        0.0  0.0    0.0      0.0  0.000000    0.0  0.0  
747  0.109805  0.0        0.0  0.0    0.0      0.0  0.000000    0.0  0.0  
748  0.076134  0.0        0.0  0.0    0.0      0.0  0.000000    0.0  0.0  
749  0.174062  0.0        0.0  0.0    0.0      0.0  0.000000    0.0  0.0  

[750 rows x 437 columns]
Filtered Term-Document Frequency DataFrame for Category not nostalgia:
      16  2019   50   60  60s     about  absolutely  actress  actually  after  \
0    0.0   0.0  0.0  0.0  0.0  0.000000         0.0      0.0       0.0    0.0   
1    0.0   0.0  0.0  0.0  0.0  0.000000         0.0      0.0       0.0    0.0   
2    0.0   0.0  0.0  0.0  0.0  0.000000         0.0      0.0       0.0    0.0   
3    0.0   0.0  0.0  0.0  0.0  0.000000         0.0      0.0       0.0    0.0   
4    0.0   0.0  0.0  0.0  0.0  0.000000         0.0      0.0       0.0    0.0   
..   ...   ...  ...  ...  ...       ...         ...      ...       ...    ...   
744  0.0   0.0  0.0  0.0  0.0  0.255993         0.0      0.0       0.0    0.0   
745  0.0   0.0  0.0  0.0  0.0  0.000000         0.0      0.0       0.0    0.0   
746  0.0   0.0  0.0  0.0  0.0  0.000000         0.0      0.0       0.0    0.0   
747  0.0   0.0  0.0  0.0  0.0  0.000000         0.0      0.0       0.0    0.0   
748  0.0   0.0  0.0  0.0  0.0  0.000000         0.0      0.0       0.0    0.0   

     ...  wow  wrong  year     years  yes  yet       you  young      your  \
0    ...  0.0    0.0   0.0  0.000000  0.0  0.0  0.157347    0.0  0.000000   
1    ...  0.0    0.0   0.0  0.094608  0.0  0.0  0.057036    0.0  0.000000   
2    ...  0.0    0.0   0.0  0.000000  0.0  0.0  0.458635    0.0  0.000000   
3    ...  0.0    0.0   0.0  0.000000  0.0  0.0  0.000000    0.0  0.000000   
4    ...  0.0    0.0   0.0  0.000000  0.0  0.0  0.000000    0.0  0.000000   
..   ...  ...    ...   ...       ...  ...  ...       ...    ...       ...   
744  ...  0.0    0.0   0.0  0.000000  0.0  0.0  0.176040    0.0  0.000000   
745  ...  0.0    0.0   0.0  0.000000  0.0  0.0  0.000000    0.0  0.000000   
746  ...  0.0    0.0   0.0  0.000000  0.0  0.0  0.000000    0.0  0.000000   
747  ...  0.0    0.0   0.0  0.000000  0.0  0.0  0.297079    0.0  0.244427   
748  ...  0.0    0.0   0.0  0.000000  0.0  0.0  0.144532    0.0  0.000000   

     youtube  
0        0.0  
1        0.0  
2        0.0  
3        0.0  
4        0.0  
..       ...  
744      0.0  
745      0.0  
746      0.0  
747      0.0  
748      0.0  

[749 rows x 435 columns]
In [80]:
# 刪除同時出現在兩個類別的特徵
# 初始化一個字典來儲存每個類別的篩選後特徵列名稱
filtered_columns = {}

# 逐類別進行篩選並儲存篩選後的列名稱
for category in categories:
    filtered_columns[category] = select_dfs_TFIDF[category].columns.tolist()
    print(f"Filtered columns for Category {category} and len is {len(filtered_columns[category])}):")
    print(filtered_columns[category])

unique_filtered_columns = set(filtered_columns[categories[0]]).union(set(filtered_columns[categories[1]]))
merged_filtered_columns = list(unique_filtered_columns)

# 顯示合併後的特徵列名稱
print(f"Merged unique filtered columns and len is {len(merged_filtered_columns)}:")
print(merged_filtered_columns)
Filtered columns for Category nostalgia and len is 437):
['10', '12', '13', '14', '16', '17', '18', '1963', '1966', '1973', '1975', '20', '2018', '2019', '30', '40', '50', '50s', '55', '56', '60', '60s', '70', '70s', '80', '80s', '90', 'about', 'absolutely', 'actually', 'adore', 'after', 'afternoon', 'again', 'age', 'ago', 'album', 'alive', 'all', 'almost', 'always', 'am', 'amazing', 'an', 'and', 'another', 'anymore', 'are', 'around', 'artists', 'as', 'at', 'ate', 'away', 'awesome', 'back', 'be', 'beautiful', 'because', 'been', 'before', 'being', 'best', 'better', 'big', 'billy', 'bless', 'born', 'both', 'boy', 'boyfriend', 'brenda', 'brilliant', 'bring', 'bringing', 'brings', 'brother', 'brought', 'but', 'by', 'came', 'can', 'car', 'carl', 'cassette', 'changed', 'child', 'childhood', 'classic', 'clearly', 'club', 'come', 'coming', 'could', 'country', 'cry', 'crying', 'dad', 'daddy', 'damn', 'dance', 'danced', 'dancing', 'date', 'day', 'days', 'deceased', 'definitely', 'did', 'didn', 'died', 'do', 'don', 'during', 'each', 'early', 'elvis', 'end', 'engelbert', 'era', 'especially', 'even', 'ever', 'every', 'everyday', 'everyone', 'everything', 'everytime', 'evokes', 'ex', 'eyes', 'family', 'fantastic', 'fast', 'father', 'favorite', 'favorites', 'feel', 'feeling', 'felt', 'few', 'finally', 'find', 'first', 'flies', 'for', 'forever', 'forget', 'forgot', 'friend', 'friends', 'from', 'full', 'germany', 'get', 'getting', 'girl', 'girlfriend', 'girls', 'glad', 'go', 'god', 'gone', 'good', 'got', 'grade', 'grandma', 'grandmother', 'grandparents', 'grannys', 'great', 'grew', 'group', 'grow', 'growing', 'had', 'hahaha', 'happiest', 'happy', 'hard', 'has', 'have', 'he', 'hear', 'heard', 'hearing', 'heart', 'heaven', 'her', 'here', 'high', 'him', 'his', 'hit', 'holiday', 'house', 'how', 'humperdinck', 'if', 'in', 'into', 'is', 'it', 'its', 'jim', 'july', 'june', 'just', 'karen', 'kid', 'kind', 'know', 'lady', 'lane', 'last', 'late', 'later', 'laura', 'learned', 'left', 'life', 'like', 'liked', 'listen', 'listened', 'listening', 'little', 'live', 'lol', 'lonely', 'long', 'looking', 'lot', 'lots', 'love', 'loved', 'lovely', 'lyrics', 'machine', 'made', 'make', 'makes', 'mama', 'man', 'many', 'marvellous', 'marvelous', 'mary', 'me', 'memories', 'memory', 'met', 'mid', 'mind', 'mine', 'miracles', 'miss', 'missed', 'mom', 'moments', 'more', 'morning', 'most', 'mother', 'much', 'mum', 'mums', 'music', 'my', 'name', 'need', 'never', 'new', 'nice', 'night', 'no', 'nostalgia', 'nostalgic', 'not', 'nothing', 'now', 'of', 'oh', 'old', 'older', 'oldies', 'omg', 'on', 'once', 'one', 'only', 'or', 'our', 'out', 'over', 'parents', 'part', 'party', 'passed', 'past', 'people', 'pictures', 'pilot', 'play', 'played', 'player', 'playing', 'please', 'posting', 'radio', 'real', 'really', 'record', 'reeves', 'remember', 'remembered', 'remembering', 'remind', 'reminded', 'reminds', 'reminiscing', 'right', 'rip', 'rock', 'sad', 'same', 'sang', 'saturday', 'say', 'school', 'sears', 'see', 'senior', 'sentimentality', 'she', 'simpler', 'since', 'sing', 'singer', 'singing', 'sister', 'skating', 'sleep', 'so', 'some', 'someone', 'song', 'songs', 'soul', 'sounds', 'special', 'still', 'such', 'summer', 'sunday', 'sure', 'sweet', 'take', 'takes', 'tape', 'tears', 'teen', 'teenager', 'than', 'thank', 'thanks', 'that', 'thats', 'the', 'their', 'them', 'then', 'there', 'these', 'they', 'things', 'think', 'this', 'those', 'though', 'thought', 'time', 'timeless', 'times', 'to', 'today', 'too', 'top', 'track', 'tune', 'tv', 'understand', 'until', 'up', 'us', 'usa', 'use', 'used', 'very', 'video', 'voice', 'want', 'was', 'wasn', 'way', 'we', 'wedding', 'well', 'were', 'what', 'when', 'whenever', 'where', 'which', 'while', 'who', 'why', 'will', 'wish', 'with', 'woman', 'wonderful', 'words', 'world', 'would', 'wow', 'ya', 'year', 'years', 'yes', 'yesterday', 'you', 'young', 'younger', 'your', 'youth', 'yrs']
Filtered columns for Category not nostalgia and len is 435):
['16', '2019', '50', '60', '60s', 'about', 'absolutely', 'actress', 'actually', 'after', 'again', 'age', 'ago', 'agree', 'all', 'almost', 'also', 'always', 'am', 'amazing', 'an', 'and', 'another', 'any', 'anybody', 'anymore', 'anyone', 'anything', 'appreciate', 'appreciated', 'are', 'around', 'artist', 'as', 'at', 'awesome', 'baby', 'back', 'background', 'bad', 'band', 'be', 'beat', 'beautiful', 'beauty', 'because', 'been', 'before', 'believe', 'best', 'better', 'billy', 'bit', 'bless', 'born', 'bought', 'boy', 'break', 'brenda', 'brilliant', 'brothers', 'but', 'by', 'called', 'came', 'can', 'certainly', 'change', 'childhood', 'class', 'classic', 'classics', 'close', 'come', 'comma', 'comment', 'compose', 'concert', 'could', 'country', 'course', 'crap', 'cry', 'crying', 'dance', 'dancing', 'daughter', 'day', 'days', 'dear', 'did', 'didn', 'different', 'disco', 'do', 'does', 'don', 'done', 'down', 'dynamite', 'early', 'earth', 'else', 'elvis', 'emotion', 'end', 'englebert', 'english', 'enjoy', 'era', 'especially', 'even', 'ever', 'every', 'everyday', 'everyone', 'everything', 'eyes', 'falling', 'family', 'fantastic', 'favorite', 'favorites', 'feel', 'feeling', 'female', 'few', 'find', 'first', 'for', 'forever', 'forget', 'found', 'friend', 'from', 'full', 'future', 'generation', 'generations', 'get', 'girl', 'give', 'glad', 'go', 'god', 'goes', 'gold', 'golden', 'gone', 'gonna', 'good', 'gorgeous', 'got', 'great', 'greatest', 'grew', 'guy', 'guys', 'had', 'handsome', 'hank', 'happened', 'happy', 'has', 'have', 'he', 'head', 'hear', 'heard', 'hearing', 'heart', 'heaven', 'her', 'here', 'him', 'his', 'history', 'hit', 'hits', 'home', 'hope', 'how', 'if', 'images', 'in', 'into', 'intro', 'irreplaceable', 'is', 'it', 'its', 'just', 'keep', 'kind', 'king', 'know', 'lady', 'last', 'late', 'laura', 'learn', 'least', 'leave', 'left', 'legend', 'let', 'life', 'like', 'listen', 'listened', 'listening', 'little', 'live', 'lived', 'll', 'lol', 'lonely', 'long', 'look', 'looking', 'looks', 'loss', 'lost', 'lot', 'love', 'loved', 'lovely', 'loving', 'lyrics', 'made', 'magnificent', 'make', 'makes', 'man', 'mans', 'many', 'masterpiece', 'matter', 'may', 'me', 'mean', 'meaning', 'melody', 'men', 'mind', 'miss', 'mom', 'moment', 'more', 'most', 'movie', 'much', 'music', 'my', 'na', 'name', 'never', 'new', 'nice', 'no', 'not', 'nothing', 'now', 'nowadays', 'of', 'off', 'oh', 'old', 'on', 'once', 'one', 'ones', 'only', 'or', 'original', 'others', 'our', 'out', 'over', 'paint', 'parents', 'part', 'peace', 'people', 'perfect', 'performance', 'person', 'pictures', 'play', 'played', 'please', 'pleasure', 'pop', 'posting', 'prefer', 'pretty', 'pure', 'put', 'radio', 're', 'read', 'real', 'really', 'record', 'recordings', 'remains', 'rest', 'right', 'rock', 'roll', 'romantic', 'roy', 'sad', 'same', 'sang', 'say', 'says', 'screen', 'see', 'seen', 'sharing', 'she', 'should', 'sing', 'singer', 'singers', 'singing', 'single', 'sings', 'so', 'some', 'someone', 'something', 'song', 'songs', 'sorrow', 'soul', 'sound', 'sounds', 'special', 'stars', 'started', 'still', 'such', 'sung', 'supernatural', 'sure', 'take', 'talented', 'taste', 'tears', 'tell', 'teresa', 'terrific', 'than', 'thank', 'thanks', 'that', 'the', 'their', 'them', 'then', 'there', 'these', 'they', 'thing', 'things', 'think', 'this', 'those', 'though', 'thought', 'time', 'timeless', 'times', 'titans', 'to', 'today', 'told', 'too', 'touching', 'true', 'truly', 'tune', 'understand', 'unique', 'until', 'untouchable', 'up', 'us', 'used', 've', 'version', 'very', 'video', 'vocals', 'voice', 'voices', 'wake', 'want', 'was', 'way', 'we', 'well', 'were', 'what', 'when', 'where', 'wherever', 'which', 'who', 'why', 'wife', 'will', 'wish', 'wished', 'with', 'without', 'woman', 'wonder', 'wonderful', 'words', 'work', 'world', 'would', 'wow', 'wrong', 'year', 'years', 'yes', 'yet', 'you', 'young', 'your', 'youtube']
Merged unique filtered columns and len is 602:
['left', 'were', '18', 'germany', 'listening', 'afternoon', 'unique', 'age', 'school', 'humperdinck', 'just', 'earth', 'says', '70s', 'recordings', 'nothing', 'adore', 'greatest', 'lovely', 'life', 'mind', 'its', 'miracles', 'away', 'many', 'bring', '2018', 'singer', 'once', 'no', '55', 'pure', 'beauty', 'player', 'hard', 'about', 'bringing', 'senior', 'roll', 've', 'certainly', 'loved', 'truly', 'performance', 'lane', 'playing', 'and', 'each', 'pop', 'old', 'dear', '70', 'paint', 'men', 'new', 'us', 'never', 'wrong', 'beautiful', 'ever', 'our', 'screen', 'get', 'crap', 'roy', 'single', 'saturday', 'early', 'here', 'legend', 'but', 'teenager', 'sweet', 'takes', 'grannys', 'sung', 'sears', 'wonderful', 'do', 'woman', 'billy', 'ex', 'voice', 'morning', 'skating', 'thing', 'top', 'goes', 'moments', 'feel', 'sure', '16', 'dynamite', 'images', 'wow', 'machine', 'hear', 'heard', 'know', 'had', 'child', 'pilot', 'older', 'to', 'year', 'anymore', 'high', 'mums', 'na', 'grandmother', 'wished', 'listen', 'mine', 'radio', 'history', 'makes', 'another', 'if', 'got', 'at', 'friend', 'great', 'such', 'nostalgic', 'party', 'country', 'person', 'almost', 'enjoy', 'supernatural', 'something', 'always', 'actress', 'songs', 'happened', 'have', 'they', 'singers', 'without', 'intro', 'best', 'dancing', 'leave', 'lived', 'loss', 'mean', 'beat', 'kid', 'we', 'grew', 'englebert', 'melody', 'too', 'still', 'oldies', '10', 'sad', 'baby', 'where', 'hope', 'since', 'close', 'real', 'generation', 'voices', 'handsome', 'understand', 'find', '50', 'video', 'should', 'remind', 'lady', 'reminiscing', 'matter', 'eyes', 'days', 'anybody', 'why', 'memory', 'usa', 'comment', 'it', 'hahaha', 'their', 'grandparents', 'guys', 'others', 'ones', 'she', 'is', 'album', 'even', 'bless', 'brothers', 'generations', 'reminds', 'thank', 'into', 'today', 'now', 'happiest', 'around', 'else', 'terrific', 'also', 'missed', 'let', 'karen', 'keep', 'not', 'mary', 'sister', 'class', 'mother', 'every', 'when', 'part', 'younger', '1966', 'yrs', 'lonely', 'disco', 'very', 'pictures', '13', 'much', 'the', 'forgot', 'whenever', 'fast', 'world', 'version', 'deceased', 'emotion', 'didn', 'give', 'in', 'course', 'say', 'music', 'movie', 'flies', 'found', 'day', 'your', 're', 'because', 'summer', 'talented', 'that', 'ate', 'most', 'looks', 'end', 'only', 'yesterday', 'group', 'first', 'hank', 'gonna', 'good', '1963', 'name', 'awesome', 'right', 'parents', 'date', '14', '2019', 'forever', 'everyone', 'simpler', 'time', 'mama', 'before', 'an', 'same', 'cassette', 'how', 'feeling', 'believe', 'looking', 'gorgeous', 'this', 'any', '56', 'which', 'omg', 'hits', 'man', 'mum', 'wasn', '40', 'teen', 'bought', 'favorite', 'original', 'everytime', 'girl', 'brenda', 'peace', 'track', 'make', 'read', 'remains', 'yes', 'glad', 'anything', 'wish', 'them', 'her', 'youtube', 'ya', 'grade', 'born', 'clearly', 'tape', 'lol', 'learn', 'club', 'daughter', 'see', 'liked', 'happy', 'or', 'crying', 'need', 'appreciate', 'pretty', 'sharing', 'least', 'times', 'back', 'lost', 'lyrics', 'after', 'sounds', 'full', 'up', 'he', 'like', 'touching', 'heaven', 'head', 'met', 'wherever', 'felt', 'favorites', 'appreciated', 'brother', 'by', 'thanks', 'family', 'marvellous', 'passed', 'rip', 'few', 'holiday', 'years', 'magnificent', 'later', 'take', 'rock', 'go', 'one', 'can', 'father', 'titans', 'ago', 'singing', 'been', 'you', 'teresa', 'thought', 'out', 'than', 'put', 'nice', 'untouchable', 'did', 'anyone', 'heart', 'want', 'being', 'bit', 'pleasure', 'brings', 'tv', 'compose', 'especially', 'all', 'better', 'house', 'those', 'future', 'july', 'stars', 'sing', 'me', 'laura', 'will', 'masterpiece', 'who', 'absolutely', 'his', 'prefer', 'remembering', 'be', 'love', 'people', 'concert', 'think', 'wedding', 'during', 'then', 'don', 'everything', 'miss', 'grandma', 'tune', 'work', '80', 'll', 'boy', 'falling', 'irreplaceable', 'learned', 'perfect', 'brought', 'words', 'female', 'though', 'as', '80s', 'brilliant', 'really', 'record', 'youth', 'late', 'was', 'well', 'posting', 'mom', 'june', 'mid', 'wife', 'sunday', 'reeves', '60s', 'has', 'bad', 'coming', 'growing', '17', 'home', 'someone', 'classics', 'god', 'died', 'artists', 'big', 'definitely', 'last', 'made', 'what', 'things', 'soul', 'him', 'loving', 'look', 'young', 'please', 'little', 'use', '60', 'listened', 'over', 'on', 'am', 'used', 'childhood', 'cry', 'of', 'moment', 'carl', 'night', 'play', 'lot', 'background', 'off', 'reminded', 'girlfriend', 'vocals', 'tell', 'more', '1975', 'long', 'done', 'change', 'come', 'called', 'wonder', 'may', 'romantic', 'remember', 'nowadays', 'could', 'hearing', 'my', 'english', 'alive', 'oh', 'special', 'changed', 'actually', 'thats', '20', 'kind', 'sorrow', 'getting', 'daddy', 'friends', 'tears', 'yet', 'there', 'again', 'elvis', 'taste', 'memories', '12', 'girls', 'live', 'sentimentality', 'are', 'sang', 'danced', 'mans', 'guy', 'car', 'sleep', 'down', 'told', 'would', 'past', 'lots', 'these', 'until', 'golden', 'agree', 'seen', 'song', 'jim', 'classic', 'break', 'remembered', 'grow', 'does', 'for', 'band', 'gone', 'sound', 'with', '90', 'era', 'hit', 'so', 'artist', 'comma', 'true', '30', 'king', 'while', 'engelbert', 'everyday', 'some', 'dance', 'played', 'nostalgia', 'timeless', 'amazing', 'finally', 'boyfriend', 'came', 'meaning', 'way', 'both', '50s', 'from', 'marvelous', 'sings', 'forget', 'different', 'fantastic', 'started', 'damn', 'wake', 'gold', '1973', 'dad', 'rest', 'evokes']
In [82]:
vectorizer_combined = TfidfVectorizer(vocabulary=merged_filtered_columns)
tfidf_combined_matrix = vectorizer_combined.fit_transform(X.comment)
tfidf_combined_array = tfidf_combined_matrix.toarray()
combined_df_TFIDF = pd.DataFrame(tfidf_combined_array, columns=vectorizer_combined.get_feature_names_out())
print(combined_df_TFIDF)
      left      were   18  germany  listening  afternoon  unique       age  \
0      0.0  0.000000  0.0      0.0   0.000000        0.0     0.0  0.176167   
1      0.0  0.000000  0.0      0.0   0.000000        0.0     0.0  0.000000   
2      0.0  0.125393  0.0      0.0   0.000000        0.0     0.0  0.150142   
3      0.0  0.000000  0.0      0.0   0.000000        0.0     0.0  0.000000   
4      0.0  0.000000  0.0      0.0   0.280711        0.0     0.0  0.000000   
...    ...       ...  ...      ...        ...        ...     ...       ...   
1494   0.0  0.000000  0.0      0.0   0.000000        0.0     0.0  0.000000   
1495   0.0  0.000000  0.0      0.0   0.000000        0.0     0.0  0.000000   
1496   0.0  0.000000  0.0      0.0   0.000000        0.0     0.0  0.309594   
1497   0.0  0.000000  0.0      0.0   0.000000        0.0     0.0  0.350538   
1498   0.0  0.000000  0.0      0.0   0.000000        0.0     0.0  0.000000   

      school  humperdinck  ...  different  fantastic   started  damn  wake  \
0        0.0          0.0  ...        0.0        0.0  0.000000   0.0   0.0   
1        0.0          0.0  ...        0.0        0.0  0.000000   0.0   0.0   
2        0.0          0.0  ...        0.0        0.0  0.000000   0.0   0.0   
3        0.0          0.0  ...        0.0        0.0  0.000000   0.0   0.0   
4        0.0          0.0  ...        0.0        0.0  0.000000   0.0   0.0   
...      ...          ...  ...        ...        ...       ...   ...   ...   
1494     0.0          0.0  ...        0.0        0.0  0.000000   0.0   0.0   
1495     0.0          0.0  ...        0.0        0.0  0.000000   0.0   0.0   
1496     0.0          0.0  ...        0.0        0.0  0.000000   0.0   0.0   
1497     0.0          0.0  ...        0.0        0.0  0.212079   0.0   0.0   
1498     0.0          0.0  ...        0.0        0.0  0.000000   0.0   0.0   

      gold  1973  dad  rest  evokes  
0      0.0   0.0  0.0   0.0     0.0  
1      0.0   0.0  0.0   0.0     0.0  
2      0.0   0.0  0.0   0.0     0.0  
3      0.0   0.0  0.0   0.0     0.0  
4      0.0   0.0  0.0   0.0     0.0  
...    ...   ...  ...   ...     ...  
1494   0.0   0.0  0.0   0.0     0.0  
1495   0.0   0.0  0.0   0.0     0.0  
1496   0.0   0.0  0.0   0.0     0.0  
1497   0.0   0.0  0.0   0.0     0.0  
1498   0.0   0.0  0.0   0.0     0.0  

[1499 rows x 602 columns]

augmented_df 的製作¶

In [83]:
augmented_df_combined = pd.concat([TFIDF_df, combined_df_TFIDF], axis=1) 
augmented_df_combined
Out[83]:
00 000 045 07 10 100 10m 11 11th 12 ... different fantastic started damn wake gold 1973 dad rest evokes
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1494 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1495 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1496 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1497 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.212079 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1498 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0

1499 rows × 4332 columns

3.3.4 Dimensionality Reduction¶

2D by PCA, t-SNE, UMAP¶

In [84]:
X_pca, X_tsne, X_umap = Dimensionality_2D(augmented_df_combined)
In [85]:
draw_2D_plt(X_pca, X_tsne, X_umap)
No description has been provided for this image

2D by Isomap, MDS¶

In [92]:
from sklearn.manifold import Isomap, MDS

def Dimensionality_2D_new(now_df):
    X_isomap = Isomap(n_components=2).fit_transform(now_df.values)
    X_mds = MDS(n_components=2).fit_transform(now_df.values)
    return X_isomap, X_mds

def draw_2D_plt_new(X_isomap, X_mds, categories):
    plt.close()
    fig, axes = plt.subplots(1, 2, figsize=(20, 10))  
    fig.suptitle('Isomap and MDS Comparison')
    # 繪製每個降維結果
    plot_scatter(axes[0], X_isomap, 'Isomap')
    plot_scatter(axes[1], X_mds, 'MDS')
    
    plt.show()
In [93]:
X_isomap, X_mds = Dimensionality_2D_new(TFIDF_df)
draw_2D_plt_new(X_isomap, X_mds, categories)
No description has been provided for this image
In [94]:
X_isomap, X_mds = Dimensionality_2D_new(augmented_df_combined)
draw_2D_plt_new(X_isomap, X_mds, categories)
No description has been provided for this image

4. Data Exploration¶

In [95]:
# We retrieve 3 sentences for a random record
document_to_transform_1 = []
random_record_1 = X.iloc[50]
random_record_1 = random_record_1['comment']
document_to_transform_1.append(random_record_1)

document_to_transform_2 = []
random_record_2 = X.iloc[100]
random_record_2 = random_record_2['comment']
document_to_transform_2.append(random_record_2)

document_to_transform_3 = []
random_record_3 = X.iloc[150]
random_record_3 = random_record_3['comment']
document_to_transform_3.append(random_record_3)
In [96]:
print(document_to_transform_1)
print(document_to_transform_2)
print(document_to_transform_3)
['If I remember correctly, this song came out after Mr. Reeves passed away. I was about 10 years old when the disc jockey said that the news just came over the wire that he died in a plane crash.']
['i guess most of us leave it too late before we tell someone just how much we really love them']
['my name is thomas but know by tommy and my wifes name is laura and i always sing this to her']
In [97]:
from sklearn.preprocessing import binarize

# Transform sentence with Vectorizers
document_vector_count_1 = count_vect.transform(document_to_transform_1)
document_vector_count_2 = count_vect.transform(document_to_transform_2)
document_vector_count_3 = count_vect.transform(document_to_transform_3)

# Binarize vectors to simplify: 0 for abscence, 1 for prescence
document_vector_count_1_bin = binarize(document_vector_count_1)
document_vector_count_2_bin = binarize(document_vector_count_2)
document_vector_count_3_bin = binarize(document_vector_count_3)

# print vectors
print("Let's take a look at the count vectors:")
print(document_vector_count_1.todense())
print(document_vector_count_2.todense())
print(document_vector_count_3.todense())
Let's take a look at the count vectors:
[[0 0 0 ... 0 0 0]]
[[0 0 0 ... 0 0 0]]
[[0 0 0 ... 0 0 0]]
In [98]:
from sklearn.metrics.pairwise import cosine_similarity

# Calculate Cosine Similarity
cos_sim_count_1_2 = cosine_similarity(document_vector_count_1, document_vector_count_2, dense_output=True)
cos_sim_count_1_3 = cosine_similarity(document_vector_count_1, document_vector_count_3, dense_output=True)
cos_sim_count_2_3 = cosine_similarity(document_vector_count_2, document_vector_count_3, dense_output=True)

cos_sim_count_1_1 = cosine_similarity(document_vector_count_1, document_vector_count_1, dense_output=True)
cos_sim_count_2_2 = cosine_similarity(document_vector_count_2, document_vector_count_2, dense_output=True)
cos_sim_count_3_3 = cosine_similarity(document_vector_count_3, document_vector_count_3, dense_output=True)

# Print the cosine similarity values
print("Cosine Similarity using count between 1 and 2: %.4f" % cos_sim_count_1_2[0][0])
print("Cosine Similarity using count between 1 and 3: %.4f" % cos_sim_count_1_3[0][0])
print("Cosine Similarity using count between 2 and 3: %.4f" % cos_sim_count_2_3[0][0])

print("Cosine Similarity using count between 1 and 1: %.4f" % cos_sim_count_1_1[0][0])
print("Cosine Similarity using count between 2 and 2: %.4f" % cos_sim_count_2_2[0][0])
print("Cosine Similarity using count between 3 and 3: %.4f" % cos_sim_count_3_3[0][0])
Cosine Similarity using count between 1 and 2: 0.0322
Cosine Similarity using count between 1 and 3: 0.0279
Cosine Similarity using count between 2 and 3: 0.0000
Cosine Similarity using count between 1 and 1: 1.0000
Cosine Similarity using count between 2 and 2: 1.0000
Cosine Similarity using count between 3 and 3: 1.0000

5. Data Classification¶

In [99]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report, accuracy_score

category_mapping = dict(X[['sentiment', 'sentiment_name']].drop_duplicates().values)
print(category_mapping)

target_names = [category_mapping[label] for label in sorted(category_mapping.keys())]
print(target_names)
{1: 'not nostalgia', 0: 'nostalgia'}
['nostalgia', 'not nostalgia']
In [100]:
def Bernoulli_model(X_train, X_test, y_train, y_test):
    model = BernoulliNB()
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred, target_names=target_names,digits=4)
    print("準確率:", accuracy)
    print("分類報告:\n", report)
In [101]:
def Multinomial_model(X_train, X_test, y_train, y_test):
    model = MultinomialNB()
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred, target_names=target_names,digits=4)
    print("準確率:", accuracy)
    print("分類報告:\n", report)
In [102]:
def Gaussian_model(X_train, X_test, y_train, y_test):
    model = GaussianNB()
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred, target_names=target_names,digits=4)
    print("準確率:", accuracy)
    print("分類報告:\n", report)

5.1 CountVectorizer¶

5.1.1 tdm_df¶

In [103]:
X_train, X_test, y_train, y_test = train_test_split(tdm_df, X['sentiment'], test_size=0.3, random_state=2)
In [104]:
Bernoulli_model(X_train, X_test, y_train, y_test)
準確率: 0.8866666666666667
分類報告:
                precision    recall  f1-score   support

    nostalgia     0.8756    0.8955    0.8854       220
not nostalgia     0.8978    0.8783    0.8879       230

     accuracy                         0.8867       450
    macro avg     0.8867    0.8869    0.8867       450
 weighted avg     0.8869    0.8867    0.8867       450

In [105]:
Multinomial_model(X_train, X_test, y_train, y_test)
準確率: 0.8711111111111111
分類報告:
                precision    recall  f1-score   support

    nostalgia     0.8293    0.9273    0.8755       220
not nostalgia     0.9216    0.8174    0.8664       230

     accuracy                         0.8711       450
    macro avg     0.8754    0.8723    0.8709       450
 weighted avg     0.8764    0.8711    0.8708       450

In [106]:
Gaussian_model(X_train, X_test, y_train, y_test)
準確率: 0.6666666666666666
分類報告:
                precision    recall  f1-score   support

    nostalgia     0.6144    0.8545    0.7148       220
not nostalgia     0.7778    0.4870    0.5989       230

     accuracy                         0.6667       450
    macro avg     0.6961    0.6708    0.6569       450
 weighted avg     0.6979    0.6667    0.6556       450

5.1.2 augmented_df_minsup¶

In [107]:
X_train, X_test, y_train, y_test = train_test_split(augmented_df_minsup, X['sentiment'], test_size=0.3, random_state=2)
In [108]:
Bernoulli_model(X_train, X_test, y_train, y_test)
準確率: 0.9044444444444445
分類報告:
                precision    recall  f1-score   support

    nostalgia     0.9116    0.8909    0.9011       220
not nostalgia     0.8979    0.9174    0.9075       230

     accuracy                         0.9044       450
    macro avg     0.9048    0.9042    0.9043       450
 weighted avg     0.9046    0.9044    0.9044       450

In [109]:
Multinomial_model(X_train, X_test, y_train, y_test)
準確率: 0.8866666666666667
分類報告:
                precision    recall  f1-score   support

    nostalgia     0.8477    0.9364    0.8898       220
not nostalgia     0.9324    0.8391    0.8833       230

     accuracy                         0.8867       450
    macro avg     0.8901    0.8877    0.8866       450
 weighted avg     0.8910    0.8867    0.8865       450

In [110]:
Gaussian_model(X_train, X_test, y_train, y_test)
準確率: 0.7733333333333333
分類報告:
                precision    recall  f1-score   support

    nostalgia     0.7500    0.8045    0.7763       220
not nostalgia     0.7991    0.7435    0.7703       230

     accuracy                         0.7733       450
    macro avg     0.7745    0.7740    0.7733       450
 weighted avg     0.7751    0.7733    0.7732       450

5.1.3 augmented_df_topK¶

In [111]:
X_train, X_test, y_train, y_test = train_test_split(augmented_df_topK, X['sentiment'], test_size=0.3, random_state=2)
In [112]:
Bernoulli_model(X_train, X_test, y_train, y_test)
準確率: 0.9044444444444445
分類報告:
                precision    recall  f1-score   support

    nostalgia     0.9078    0.8955    0.9016       220
not nostalgia     0.9013    0.9130    0.9071       230

     accuracy                         0.9044       450
    macro avg     0.9046    0.9042    0.9044       450
 weighted avg     0.9045    0.9044    0.9044       450

In [113]:
Multinomial_model(X_train, X_test, y_train, y_test)
準確率: 0.8777777777777778
分類報告:
                precision    recall  f1-score   support

    nostalgia     0.8340    0.9364    0.8822       220
not nostalgia     0.9310    0.8217    0.8730       230

     accuracy                         0.8778       450
    macro avg     0.8825    0.8791    0.8776       450
 weighted avg     0.8836    0.8778    0.8775       450

In [114]:
Gaussian_model(X_train, X_test, y_train, y_test)
準確率: 0.7111111111111111
分類報告:
                precision    recall  f1-score   support

    nostalgia     0.6531    0.8727    0.7471       220
not nostalgia     0.8205    0.5565    0.6632       230

     accuracy                         0.7111       450
    macro avg     0.7368    0.7146    0.7051       450
 weighted avg     0.7386    0.7111    0.7042       450

5.1.4 augmented_df_max¶

In [115]:
X_train, X_test, y_train, y_test = train_test_split(augmented_df_max, X['sentiment'], test_size=0.3, random_state=2)
In [116]:
Bernoulli_model(X_train, X_test, y_train, y_test)
準確率: 0.8911111111111111
分類報告:
                precision    recall  f1-score   support

    nostalgia     0.8977    0.8773    0.8874       220
not nostalgia     0.8851    0.9043    0.8946       230

     accuracy                         0.8911       450
    macro avg     0.8914    0.8908    0.8910       450
 weighted avg     0.8913    0.8911    0.8911       450

Multinomial_model(X_train, X_test, y_train, y_test)

In [117]:
Gaussian_model(X_train, X_test, y_train, y_test)
準確率: 0.7511111111111111
分類報告:
                precision    recall  f1-score   support

    nostalgia     0.7030    0.8500    0.7695       220
not nostalgia     0.8207    0.6565    0.7295       230

     accuracy                         0.7511       450
    macro avg     0.7618    0.7533    0.7495       450
 weighted avg     0.7631    0.7511    0.7491       450

5.2 TFIDF¶

5.2.1 TFIDF_df (原始資料)¶

In [118]:
X_train, X_test, y_train, y_test = train_test_split(TFIDF_df, X['sentiment'], test_size=0.3, random_state=2)
In [119]:
Bernoulli_model(X_train, X_test, y_train, y_test)
準確率: 0.8866666666666667
分類報告:
                precision    recall  f1-score   support

    nostalgia     0.8756    0.8955    0.8854       220
not nostalgia     0.8978    0.8783    0.8879       230

     accuracy                         0.8867       450
    macro avg     0.8867    0.8869    0.8867       450
 weighted avg     0.8869    0.8867    0.8867       450

In [120]:
Multinomial_model(X_train, X_test, y_train, y_test)
準確率: 0.8555555555555555
分類報告:
                precision    recall  f1-score   support

    nostalgia     0.7992    0.9409    0.8643       220
not nostalgia     0.9319    0.7739    0.8456       230

     accuracy                         0.8556       450
    macro avg     0.8656    0.8574    0.8550       450
 weighted avg     0.8671    0.8556    0.8547       450

In [121]:
Gaussian_model(X_train, X_test, y_train, y_test)
準確率: 0.66
分類報告:
                precision    recall  f1-score   support

    nostalgia     0.6184    0.7955    0.6958       220
not nostalgia     0.7305    0.5304    0.6146       230

     accuracy                         0.6600       450
    macro avg     0.6745    0.6629    0.6552       450
 weighted avg     0.6757    0.6600    0.6543       450

5.2.2 augmented_df_combined (使用方差)¶

In [122]:
X_train, X_test, y_train, y_test = train_test_split(augmented_df_combined, X['sentiment'], test_size=0.3, random_state=2)
In [123]:
X_test
Out[123]:
00 000 045 07 10 100 10m 11 11th 12 ... different fantastic started damn wake gold 1973 dad rest evokes
1321 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00000 0.0 0.0
903 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.192976 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00000 0.0 0.0
1275 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00000 0.0 0.0
69 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00000 0.0 0.0
272 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00000 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
708 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.57711 0.0 0.0
60 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00000 0.0 0.0
201 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00000 0.0 0.0
265 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00000 0.0 0.0
472 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00000 0.0 0.0

450 rows × 4332 columns

In [124]:
Bernoulli_model(X_train, X_test, y_train, y_test)
準確率: 0.8955555555555555
分類報告:
                precision    recall  f1-score   support

    nostalgia     0.8914    0.8955    0.8934       220
not nostalgia     0.8996    0.8957    0.8976       230

     accuracy                         0.8956       450
    macro avg     0.8955    0.8956    0.8955       450
 weighted avg     0.8956    0.8956    0.8956       450

In [125]:
Multinomial_model(X_train, X_test, y_train, y_test)
準確率: 0.8777777777777778
分類報告:
                precision    recall  f1-score   support

    nostalgia     0.8367    0.9318    0.8817       220
not nostalgia     0.9268    0.8261    0.8736       230

     accuracy                         0.8778       450
    macro avg     0.8818    0.8790    0.8776       450
 weighted avg     0.8828    0.8778    0.8776       450

In [126]:
Gaussian_model(X_train, X_test, y_train, y_test)
準確率: 0.6688888888888889
分類報告:
                precision    recall  f1-score   support

    nostalgia     0.6263    0.8000    0.7026       220
not nostalgia     0.7396    0.5435    0.6266       230

     accuracy                         0.6689       450
    macro avg     0.6830    0.6717    0.6646       450
 weighted avg     0.6842    0.6689    0.6637       450

¶

Comment: It can be found that the effect of using Bernoulli is better than using Multinomial and Gaussian. This may be because my features appear very rarely in the document, and the latter two are easily affected by high-occurring words.

¶

In data processing, I used two other methods, FAE topK and MFPGrowth, and defined them as new functions for quick invocation with different parameters in the future. By using FAE topK, I can quickly select the most meaningful features, avoiding excessive features that could interfere with model development.

In [ ]:
 
In [ ]: